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Abstract 

Background: Tliougli tine annotation of digital artifacts with metadata has a long history, the bulk of that work 
focuses on the association of single terms or concepts to single targets. As annotation efforts expand to capture 
more complex information, annotations will need to be able to refer to knowledge structures formally defined in 
terms of more atomic knowledge structures. Existing provenance efforts in the Semantic Web domain primarily 
focus on tracking provenance at the level of whole triples and do not provide enough detail to track how 
individual triple elements of annotations were derived from triple elements of other annotations. 

Results: We present a task- and domain-independent ontological model for capturing annotations and their linkage 
to their denoted knowledge representations, which can be singular concepts or more complex sets of assertions. 
We have implemented this model as an extension of the Information Artifact Ontology in OWL and made it freely 
available, and we show how it can be integrated with several prominent annotation and provenance models. We 
present several application areas for the model, ranging from linguistic annotation of text to the annotation of 
disease-associations in genome sequences. 

Conclusions: With this model, progressively more complex annotations can be composed from other annotations, 
and the provenance of compositional annotations can be represented at the annotation level or at the level of 
individual elements of the RDF triples composing the annotations. This in turn allows for progressively richer 
annotations to be constructed from previous annotation efforts, the precise provenance recording of which 
facilitates evidence-based inference and error tracking. 

Keywords: Ontology, Conceptual data modeling. Annotation, IVlarkup, Provenance, OWL, RDF 



Background 

Annotation of artifacts such as documents and images 
with metadata is a scholarly practice with a long history. A 
wide variety of annotations have been represented in a 
wide range of formats. In the bulk of that work, each an- 
notation consists of a basic association of one conceptual 
resource (e.g., an ontology class, schema element, database 
identifier) with one target (e.g., document, text span, data- 
base entry) via an explicit or implicit relationship. Single- 
concept annotations have proven very useful, for example, 
in computing term enrichment [1] or for indexing for 
search [2]; however, they do not provide a detailed repre- 
sentation of the content they are describing. As information 
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needs increase and annotation efforts expand to capture 
more complex information, complex knowledge structures 
formally defined in terms of more atomic knowledge struc- 
tures will need to be represented. 

Where they exist, more structured annotations tend to 
be represented in ad hoc formats suited to one particular 
type of annotation or task but are not broadly applicable 
or interoperable. Several prominent annotation models 
not limited to specific types of tasks or information have 
been created, and components that enable annotations 
to denote knowledge structures more complex than 
atomic concepts have been added very recently to these 
models [3,4]. Yet there have been no mechanisms put 
forth by which these more complex annotations can 
refer to other annotations and by which their prov- 
enance can be unambiguously recorded. There have also 
been prominent efforts in scientific workflow prov- 
enance [5,6]. That work, however, primarily focuses on 
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annotating experimental data, typically annotating lists 
of identifiers or numeric data with their origins, not on 
annotating with dynamically composed and compo- 
sitional knowledge structures. 

An effective annotation model, in addition to being ap- 
plicable to many annotation use cases and supporting 
the specification of complex knowledge structures, needs 
to be able to unambiguously represent annotation prov- 
enance. While ontologies strive to be complete, it is 
likely that specific applications will require dynamic con- 
struction of concepts, either through data-driven methods 
[7] or compositional concept formation [8]. To support 
and document the provenance of these more complex 
annotations, annotators (both human and computational) 
need the ability to refer to existing annotations as the basis 
of more complex annotations. For example, in the lin- 
guistic domain, an annotation representing part of a 
syntactic parse tree may wish to build upon existing 
token or part-of-speech annotations. Similarly, in the 
biomedical domain, a protein interaction event annotation 
may wish to leverage existing annotations identifying spe- 
cific proteins. As annotation efforts become more ambi- 
tious, they will naturally build upon previous annotation 
efforts, and tracking the provenance of constructed know- 
ledge representations being used for annotations at a fine- 
grained level will be important to facilitate inference and 
error analysis. 

This paper proposes a task- and domain-independent 
formal ontological model for the creation of annotations 
and their linkage to their denoted knowledge representa- 
tions, which can be singular concepts or more complex 
knowledge in the form of sets of RDF assertions. With this 
model, progressively more complex annotations can be 
composed from other annotations, and this provenance 
can be unambiguously represented at either a coarse- or 
fine-grained level. We have designed our annotation 
model to be generic so as to facilitate the concurrent use 
of multiple types of annotations {e.g., syntactic annotation 
and semantic annotation). Additionally, it allows for the 
creation of arbitrarily complex annotations, both in terms 
of their denoted knowledge and of any other annotations 
upon which they rely. All of this information can be loss- 
lessly recorded, facilitating inference and error tracking in 
large computational annotation efforts. We have imple- 
mented this model as an extension of the Information 
Artifact Ontology in OWL and made it freely available. 
We also show how it can be integrated with several other 
prominent generic annotation models. 

Results 

Overview 

A central aspect of our model is the capability to accurately 
capture the provenance of annotations, in terms of precur- 
sor annotations, created by a human or computational 



annotator. The annotation model we present here is gener- 
ally applicable to arbitrarily complex, structured annota- 
tions applied to any content in any context. It is not 
specific to text annotations, although our primary use cases 
are related to understanding biomedical text. 

Our model provides two key contributions above exist- 
ing annotation and provenance models. First, we provide 
a generic model for complex and compositional annota- 
tions that extends existing general-purpose annotation 
models. Second, we provide a model for documenting 
the provenance of the construction of the triples used as 
the denoted knowledge representations by these annota- 
tions. Our model goes beyond modeling the provenance 
of whole triples (for which there are sufficient existing 
methods, as discussed in the Related Work section) and 
extends the provenance modeling to document the 
source of individual statement elements that are used to 
construct triples. 

This proposal is neutral with respect to annotation 
template; i.e., the choice of terminologies, ontologies, or 
schemas used for annotation and the nature of the de- 
noted knowledge representations is left to the annotator. 
Several existing annotation models handle the associ- 
ation of annotations to text or other targets, and we dis- 
cuss the integration of their representations with our 
model in Additional file 1 and Additional file 2. Addi- 
tionally, as this proposal focuses on the linkage of anno- 
tations to their denoted knowledge representations and 
on the provenance of these knowledge representations, 
details about the recording of other types of annotation 
metadata such as author and creation date (for which 
there are existing proposals, e.g., [3,4,9]) are largely 
elided from this paper. Finally, our ontological model is 
neutral with respect to the methodology by which any 
such annotations are created. 

We reuse or extend existing community-curated on- 
tologies where possible, and we therefore present our 
proposal as an extension of the Information Artifact 
Ontology (lAO), which is a member of the Open Bio- 
medical Ontologies library of ontologies [10] (though 
not all of the concepts of these ontologies are specific to 
the biomedical realm). The lAO focuses on the repre- 
sentation of types of information content entities, which 
are defined to stand "in relation of aboutness" to other 
entities; that is, an information content entity is in some 
way "about" some other concept(s). For example, within 
the biomedical domain, data, images, and text are all in 
some way about sets of biomedical concepts. The lAO 
provides a hierarchy of types of information content en- 
tities as well as types of aboutness, including denotation, 
in which the information content entity specifically re- 
fers to some other concept {e.g., the word "apple" de- 
notes either a specific apple or the more general concept 
of an apple). We hold that an annotation is a type of 
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information content entity, as it is in some way about 
the entity it is annotating. We are engaged in the on- 
going process of submitting our model to the lAO for 
inclusion. An OWL representation of our model as an 
extension of the lAO is provided in Additional file 3. 

Namespace and notation 

Our in-house knowledge base of biomedicine (KaBOB) 
is the aggregator of our work. KaBOB extensions of an 
ontology are named by prefixing the ontology's name- 
space with the letter 'k'; the namespace kiao: is therefore 
used for our extension of the lAO (whose namespace is 
iao:), and the ex: namespace is used for examples. In this 
document, fixed-width font will be used to identify 
concepts. Class names begin with a capital letter, while in- 
stance and property names begin with a lowercase letter. 
Additionally, instances are named mnemonically with let- 
ters corresponding to their class names; e.g., instances of 
the class Rdf ResourceAnnotation have names 
starting with "ra". RDF triples and quads are presented 
using an abbreviated n-triple/quad format for readabil- 
ity, using name-space-abbreviation: local-name 
instead of full URIs. 

Representation of annotations 

We have created a top-level Annotation class, defin- 
ing an annotation as an information content entity that 
is used to concisely describe, comment on, or otherwise 
make an assertion or set of assertions about an existing 
information content entity. Thus, for example, a linguis- 
tic part-of-speech tag can be used to annotate a word 
within a piece of text to describe its syntactic or mor- 
phological behavior; a Java keyword (e.g., @depre- 
cated) can be used to annotate a segment of Java 
source code to specify a property of a Java class, method, 
variable, parameter, or package; and a GO term can be 
used to annotate a digital representation of a gene or 
gene product to make an assertion about some aspect of 
the biological functionality of the latter. Conciseness 
seems to be a common trait among the many types of 
annotations we have considered, so, e.g., a book written 
about a poem would seem to be beyond the bounds of 
what most would consider an annotation. Additionally, 
an annotation provides additional information about an 
entity but is typically not fundamental to the entity; 
therefore, we would not consider a title of a journal art- 
icle to be an annotation of the article: Even though it 
concisely describes an existing information content en- 
tity, i.e., the body of the journal article, it is a canonically 
required part of the article. Furthermore, an annotation 
is typically either incorporated into the entity {e.g., in the 
classical case of annotation of writing in the margins of 
a book, which becomes a physical part of the book) or 
can be otherwise retrieved along with the entity it is 



annotating {e.g., in the case of GO-term annotations of 
database entries of proteins). 

A subclass of Annotation could be defined for any 
type of information content entity used to annotate another 
entity {e.g., PartOf SpeechTagAnnotation, JavaKey- 
wordAnnotation , GoTermAnnotation). However, 
since we are motivated toward utility for the Semantic 
Web, we introduce only two subclasses, RdfResour- 
ceAnnotation and Rdf GraphAnnotation, repre- 
senting RDF resources and graphs, respectively, that are 
used to annotate other information content entities. These 
two subclasses should be all that is needed for the repre- 
sentation of annotations in RDF stores, in which every- 
thing should be an RDF resource or graph. Furthermore, 
as long as information content entities used for annotation 
are offered as RDF constructs (so that they can be used in 
RDF stores), other annotation subclasses should not be 
needed for their representation in RDF stores. For ex- 
ample, since GO terms are also offered as RDF resources, 
GO-term annotations can be stored as instances of 
Rdf ResourceAnnotation, obviating the need for a 
GoTermAnnotation class (unless there is further de- 
sired axiomatization for GO-term annotations). 

Resource annotations 

In our model, a resource annotation is an annotation 
that associates a single rdf s : Resource with a target. 
A resource annotation is modeled as rdf : type kiao: 
Rdf ResourceAnnotation. The relation iao: deno- 
tes is used to associate a given annotation with the 
concept being used to annotate the target. This property 
relates an information content entity (in this case a re- 
source annotation) to something to which it is specific- 
ally intended to refer. 

One of the primary types of text annotation is syntactic 
annotation, which is often produced by text mining systems 
{e.g., [11]). To demonstrate the applicability of our model to 
syntactic annotation, we use a fragment of the example 
sentence used by Liu et al. in their study of dependency 
parsing for information extraction [12], i.e., the phrase 
"Interferons inhibit activation of STAT6". (For the purposes 
of an example, we have taken some liberty in creating 
example classes and relations that we believe are faithfiil to 
the native dependency parse representations [13].) 

Common tasks at the beginning of text mining pipelines 
include tokenization and part-of-speech tagging [14]. 
Figure 1 depicts four resource annotations: ral, ra2, 
ra3, and ra4. The concepts in the object positions of the 
denotes assertions are part of the domain model used 
by the annotator and are not part of the proposed annota- 
tion model itself, ral and ra2 denote specific instances 
of tokens (represented here as instances of the class 
Token), while ra3 and ra4 denote plural nouns and sin- 
gular present-tense verbs, respectively (represented here 
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I II 1 

Interferons inhibit activation of STAT6. . . 



Figure 1 Example syntactic annotations. This figure depicts five 
syntactic annotations as bold ovals with underlined labels: four 
Rdf ResourceAnnotation instances, each with the prefix "ra", 
and one RdfGraphAnnotation instance, prefixed with "ga". 
Rectangles represent classes, while instances have rounded corners. 
Double-lined arrows depict basedOn assertions. Thin gray arrows 
are used to provide reference to the text, although their representation 
is elided in this paper. The statements inside brackets are contained 
within the corresponding RDF graph. 



by their Perm Treebank part-of-speech tags [15]). In this 
example, the annotator made the domain-specific repre- 
sentational choice to model the tokens as instances so that 
they can be specifically referred to later by subsequent an- 
notations, as will be shown in the next section. Abstract 
relations connecting the resource annotations to text 
spans are shown in Figure 1 as gray arrows, with gray 
brackets representing the text spans. Existing models for 
linking annotations to the object being annotated can be 
used with our model, for example, the relations oa: has- 
Target [3] or ao: context [4] could be used to model 
these gray arrows. As our model is neutral relative to these 
representational decisions, this aspect of modeling the 
example annotations is elided from this document for 
simplicity and clarity. 

The following are RDF triples representing two of these 
annotations, asserting that ral and ra3 are resource 
annotations that denote a particular instance of a token 
(represented here as tl) and plural nouns (represented 
here as NNS), respectively. 

ex : ral rdf : type kiao : Rdf ResourceAnnotation . 
ex:ral iao:denotes ex:tl. 
ex : ra3 rdf : type kiao : Rdf ResourceAnnotation . 
ex:ra3 iao: denotes ex: NNS. 

Semantic annotations of text fragments, such as those 
in the CRAFT Corpus [16], are another primary use case 
for the model presented here. In Figure 2, the example 



sentence fragment from Figure 1 has been annotated 
with semantic classes in the manner of CRAFT annota- 
tion. (In some cases, we have used ontologies and classes 
not used in CRAFT in order to simplify the biology and 
therefore the example.) The biomedical classes and 
properties used to model the examples in this paper are 
not part of the proposed annotation model. In Figure 2, 
the three example resource annotations ra5, ra6, and 
ra7 denote relevant biological concepts: ra5 denotes 
interferons, a group of proteins represented here by 
Interferon (IPRO 00471) in the InterPro database 
of protein sequence signatures and families [17]; ra6 de- 
notes the upregulation of biological processes, represented 
here by positive regulation of biological 
process (GO: 0048518)^ in the Gene Ontology [18]; and 
ra7 denotes STAT6 proteins, represented here by STAT6 
(PR: 000001933) in the Protein Ontology [19]. The 
following are RDF triples for two of these annotations, 
specifically asserting that ra6 and ra7 are resource 
annotations that denote positive regulation of biologi- 
cal processes (represented here as GO:0048518) and 
STAT6 proteins (represented here as PR: 000001933), 
respectively. 

ex : ra6 rdf : type kiao : Rdf ResourceAnnotation . 
ex:ra6 iao:denotes obo:GO_0048518 . 
ex:ra7 rdf : type kiao : Rdf ResourceAnnotation . 
ex:ra7 iao:denotes obo:PR_000001933 . 

Graph annotations 

While a resource annotation relies on a single RDF 
resource for annotation, a graph annotation is an RDF 
graph, composed of a set of one or more RDF state- 
ments, that is being used to annotate another informa- 
tion content entity. A graph annotation is modeled as 
rdf: type kiao: RdfGraphAnnotation. A graph 
annotation is connected to a named graph of RDF state- 
ments using the property iao: denotes. While a graph 
annotation is directly linked to a named graph, it 
actually denotes the content of the named graph {i.e., the 
RDF graph that the named graph encodes or represents) 
and not the named graph itself; this is consistent with 
the semantics of named graphs proposed by Carroll 
et al. [20], which states that any assertion in RDF about 
the graph structure of a named graph is understood to 
refer to the underlying RDF graph. As before, the nature 
of the denoted knowledge representations {i.e., the set of 
RDF statements) is left to the user, as our metamodel 
focuses on the linkage of annotations to such represen- 
tations and, as presented in the next section, the prov- 
enance of compositional annotations. 

Linguistic annotation is frequently done in a pipeline 
where subsequent stages build upon the annotations pro- 
duced by earlier stages. In addition to the aforementioned 
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Figure 2 Example biomedical semantic annotations. This figure depicts five semantic annotations as bold ovals with underlined labels: three 
Rdf ResourceAnnotation instances and two Rdf GraphAnnotation instances. (See the caption of Figure 1 for explanation of shapes 
and arrows). 



resource annotations, Figures 1 and 2 depict several 
Rdf GraphAnnotation instances. In Figure 1, there is 
one graph annotation that denotes the subject dependency 
of token t2 on token tl, represented here as a graph 
(gl) containing one RDF triple with subject t2, property 
hasSub jectDependent, and object tl. The following 
are triples/quads for graph annotation gal: 

ex : gal rdf : type kiao : Rdf GraphAnnotation . 

ex: gal iao: denotes ex:gl. 

ex:t2 ex: hasSub jectDependent ex:tl ex:gl. 

Since the annotator made the domain-specific model- 
ing choice to represent resource annotations ral and 
ra2 as denoting instances of tokens, a dependency 
assertion among them {i.e., that token t2 syntactically 
depends on token tl as the subject of the sentence, as 
seen in Figure 1) was able to be created. If the annotator 
had pointed these resource annotations directly to the 
class Token (analogous to the direct pointing of re- 
source annotations ra3 and ra4 to the classes NNS and 
VBP, respectively), then it could only have been asserted 
at the class level that Token hasSub jectDependent 
Token rather than the assertions relating the specific 
tokens. It is important to remember that this representation 
of syntactic dependency is of our own choosing for this ex- 
ample and that any user of our metamodel of annotations 



is free to represent syntactic dependency (or any other 
knowledge denoted by the annotations) as he chooses. 

Semantic concept annotation, such as the manual 
annotation performed on the CRAFT Corpus or annota- 
tions created by text mining systems, can also be built in 
layers. Figure 2 depicts two Rdf GraphAnnotation in- 
stances ga2 and ga3. The former denotes the positive 
regulation of STAT6 protein, represented here as a 
graph g2 containing a dynamically constructed subclass 
PI of the GO class representing positive regulation 
(GO: 0048518) in which STAT6 (PR: 000001933) is 
regulated. The graph ga3 builds upon the denoted 
knowledge representation of ga2 and denotes the nega- 
tive regulation of the positive regulation of STAT6 pro- 
tein by an interferon, represented here as a dynamically 
constructed subclass Nl of the GO class representing 
negative regulation (GO: 0048519) in which the regulat- 
ing entity is an interferon (I PR 000471) and the regu- 
lated process is a positive regulation of STAT6 protein. 
The following are triples/quads for graph annotation 
ga2: 

ex : ga2 rdf : type kiao : Rdf GraphAnnotation . 
ex:ga2 iao: denotes ex:g2. 
ex:Pl rdf s : subClassOf obo:GO_0048518 ex:g2 . 
ex:Pl kro'^:resultsInRegulationOf obo: 
PR_000001933 ex:g2 . 
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The use of graphs has the advantage of separating the 
representation of the annotation from the representation 
of the denoted content and thus protects users of a 
given annotation from committing to or believing the 
propositions represented in the annotation unless 
desired. For example, a given annotation could denote 
the fact that STAT6 can bind calcium ions as one of its 
functionalities, which could be represented as, e.g., one 
RDF triple (STAT6 hasFunction Calciumlon- 
Binding) in an RDF graph. Since this triple is placed in 
its own graph, a reader of this annotation is not commit- 
ted to believing that STAT6 can bind calcium ions; what 
has been effectively represented is that this particular 
annotation says that STAT6 can bind calcium ions. Just 
as our model is agnostic with respect to the denoted 
knowledge representations of annotations, we do not 
seek here to explicitly represent confidence, trust, or 
other epistemological or modal information; however, 
such information could be modeled independently and 
added orthogonally or compositionally to our proposed 
model. 



Provenance of compositional annotations 

As annotations become more complex, tracking their 
provenance becomes increasingly important. Provenance 
tracking is necessary in order to know which other an- 
notations were used in constructing an annotation and 
how their individual denoted representations were com- 
posed into larger knowledge structures. Additionally, 
provenance is needed for error analysis and blame attri- 
bution. For example, referring to the example in Figure 2, 
if ga2 is found to be incorrect only because it refers to 
the wrong protein, the error and blame should be prop- 
erly attributed to the author of ra7, as the latter was 
the source of the incorrect protein identification. Like- 
wise if ra7 is determined to be incorrect, annotations 
dependent on that annotation could be identified and 
retracted or updated as well. 

Provenance information can be equally useful for dis- 
ambiguation. Consider the case in which there are mul- 
tiple competing annotations for the text "STAT6" {e.g., 
those denoting human, mouse, and rat homologs of the 
STAT6 protein, which are represented as distinct entities 
in prominent biological repositories such as UniProt 
[21] but are all canonically referred to as "STAT6"). If, as 
in this example, there are multiple competing annota- 
tions for the specific type of protein but only one is used 
as the provenance for a larger annotation, such as ga2, 
then this provenance can be tracked to resolve the ambi- 
guity caused by the competing annotations. This is one 
of the ways language understanding systems can 
successfully resolve ambiguity [22] . An annotation model 
such as ours that captures this information and 



provenance can be used to document these choices and 
facilitate error analysis. 

In order to track provenance, we introduce the tran- 
sitive relation kiao:basedOn, which is used to track 
both coarse-grained, annotation-level provenance and fine- 
grained, statement-element-level provenance in our model. 
We propose that this relation holds between two 
information content entities; therefore, the value of 
rdfs: domain and rdfs: range for this relation is 
iao: information content entity. Informally, 
kiao:basedOn holds between subject and object in- 
formation content entities when the subject entity has 
been created relying in whole or in part on the already 
existing object entity. In this proposal, we are interested in 
making assertions of annotation provenance by recording 
that specific annotations have been created wholly or 
partly relying on other specific annotations. We make no 
restriction on the cardinality of kiao:basedOn, so a 
subject information content entity can be based on mul- 
tiple object entities, and multiple subject information con- 
tent entities can be based on the same object entity. 

Annotation-level provenance 

The simplest way to record provenance is to make 
coarse-grained basedOn assertions between annota- 
tions. A basedOn statement can be made between two 
annotations either when there is a direct relationship be- 
tween the annotations, such as one directly using one or 
more elements of another, or when there is an indirect 
relationship, such as one being used as the justification 
for another's existence even though no part is explicitly 
shared {e.g., an annotation of a text span with a specific 
protein class being used to justify an annotation of the 
same text span with the top-level class protein 
(PR :000000001) from the Protein Ontology). 

Most syntactic dependency parsers use tokenization and 
part-of-speech tags produced by other annotation systems 
as input. Figure 1 depicts six different annotation-level 
basedOn assertions between syntactic resource annota- 
tions. Those from ra3 to ral and from ra4 to ra2 have 
been asserted because ra3 and ra4, denoting parts of 
speech, were created based on ral and ra2, denoting 
tokenization, respectively. The following triples represent 
the assertions of provenance among these four resource 
annotations: 

ex : ra3 kiao : basedOn ex : ral . 
ex : ra4 kiao : basedOn ex : ra2 . 

Note that these assertions are made even though there 
are no direct relationships among the denoted tokens 
and parts of speech, e.g., between the concepts denoted 
by ra3 and ral {i.e., between plural nouns, represented 
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here by the part-of-speech tag NNS, and token tl); how- 
ever, ral was instrumental in the creation of ra3, and 
the basedOn statement documents this. The part of the 
syntactic dependency parse annotated by the graph an- 
notation gal used both the tokenization annotations 
(ral and ra2) and the part-of-speech annotations (ra3 
and ra4) to determine that token tl is the subject of 
token t2, and thus gal is modeled with a basedOn re- 
lation to each of these four resource annotations. The 
following four triples represent the annotation-level 
provenance of graph annotation gal: 

ex: gal kiao: basedOn ex: ral. 
ex : gal kiao : basedOn ex : ra2 . 
ex: gal kiao: basedOn ex:ra3. 
ex : gal kiao : basedOn ex : ra4 . 

Just as layers of annotation can build upon each other 
in syntactic annotation, semantic annotation can also be 
composed compositionally. Provenance relations are 
analogously depicted among the semantic annotations in 
Figure 2. Graph annotation ga2 was built using infor- 
mation from resource annotations ra6 and ra7. Simi- 
larly, the larger graph annotation ga3 records that it 
was built using information from resource annotation 
ra5 and from graph annotation ga2. The provenance 
information can be traced from annotation to annota- 
tion, and in this case one can see that ga3 is (partly) 
based on ga2, which in turn is based on ra6 and raV. 
It is important to note that basedOn is general in that 
it can be used to create an annotation-level assertion of 
provenance from either a resource or graph annotation 
to a set of any combination of resource and/or graph an- 
notations. Instances of Rdf GraphAnnotation need 
not be strictly compositional; that is the statements in a 
graph annotation do not need to be based on other an- 
notations and can incorporate new information not yet 
annotated elsewhere. For example, ga3 uses additional 
information not explicitly previously annotated, which is 
shown in Figure 2 by an additional gray arrow pointing 
to a segment of the text not previously annotated. 

Statement-element-level provenance 

The second type of provenance represented in our 
model records detail at a more fine-grained level. Refer- 
ring back to Figure 2, while the assertion ga2 basedOn 
ra7 is sufficient to model that at least some part of ga2 
was based on ra7, it does not capture which elements 
of ga2 are based on ra7 . Analogously, in Figure 1, the 
graph annotation gal documents that it is based on re- 
source annotation ral but nothing more specific than 
this. If the author of gal wishes to document that the 
object element (which denotes token tl) of the RDF 



statement of gal is based on ral (which also denotes 
token tl), then recording provenance at the annotation 
level is insufficient. In addition to documenting how 
compositional annotations were constructed for under- 
standing or training purposes, this type of provenance is 
necessary to perform detailed error analysis. 

In RDF, the typical way to mal<e statements about state- 
ments is to reify the statement itself as an instance of rdf : 
Statement. An RDF statement identifies its subject, 
property, and object via the relations rdf : sub j ect, rdf : 
property, and rdf: object, respectively. However, 
RDF statements and their elements are conceptual repre- 
sentations; for example, in Figure 1, the RDF statement t2 
hasSubjectDependent tl represents the assertion 
that token t2 has as its subject token tl. To explicitly 
represent RDF statements as information content entities, 
we introduce the class kiao: Rdf Statement, which 
is rdf s : subClassOf iao: information content 
entity. An example of a reified kiao: Rdf Statement 
is the instance s 1 in Figure 3. A graph annotation can then 
be connected to each reified statement of the graph annota- 
tion using the property obo : has_part. 



©denotes / — s 
— ^^HglJ 




-II 1 

Interferons inhibit activation of STAT6... 
Figure 3 Example of statement-element provenance. This figure 
depicts an example of statement-element-level provenance. The 
Rdf Statement and the three Rdf StatementElement 

instances have bold ovals and underlined labels. This figure is an 
extension of Figure 1, and some of the parts of that figure have 
been preserved here but grayed out. Dashed lines show assertions 
that can be inferred. (See the caption of Figure 1 for explanation of 
shapes and arrows). 
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In order to record the provenance about individual 
parts of a statement, these parts must also be reified as 
instances of kiao : Rdf StatementElement. A re- 
ified statement is linl<ed to its component instances of 
kiao: Rdf StatementElement using three proper- 
ties that mirror the properties used to reify the 
rdf : Statement itself (i.e., rdf: subject, rdf: 
property, and rdf: object): kiao: subject- 
Element, kiao:propertyElement, and kiao:ob- 
jectElement, each of which is rdf s : subProper 
tyOf obo : has_part; that is, a reified statement has 
these subject, property, and object elements as parts.'^ 
Two reified instances of kiao: Rdf StatementEl- 
ement, sel and se2, can be seen in Figure 3. The 
corresponding iao: denotes assertions from these state- 
ment elements to their denoted concepts (i.e., tokens tl 
and t2, respectively) are also depicted. 

With an assertion from a graph annotation to a state- 
ment via obo : has_part and another assertion from the 
statement to a statement element via kiao: sub jectEl- 
ement, kiao:propertyElement, or kiao: object- 
Element (each a subproperty of obo : has_part), an 
obo : has_part assertion from the graph annotation to 
the statement element can also be inferred. The follow- 
ing axiom holds for subject elements, and correspond- 
ing axioms hold for property and object elements: 

( ?ga obo:has_part ?se ) < - 

( ?ga obo:has_part ?s ) && 

(?s kiao: sub jectElement ?se) 

This is simply applying obo:has_part transitively. 
Figure 3 shows two derived obo:has_part assertions: 
Since graph annotation gal has_part statement si, 
and si is linked to its component statement elements 
sel and se2 via sub jectElement and objectEl- 
ement, respectively, it can be inferred that gal has 
these statement elements as parts. 

Now that the reified statement has been decom- 
posed into and appropriately linked to reified 
Rdf StatementElement instances, the provenance 
of these individual pieces can be recorded. In the ex- 
ample in Figure 3, sel is documented as being based 
on resource annotation ral, and se2 is documented 
as being based on ra2. This is analogous to the use 
of the basedOn among annotations, except in this 
case the relation is being used among more fine- 
grained components of the model. 

Our original model [23] used a more complex set of 
properties to record the same amount and type of infor- 
mation. The model proposed in this paper simplifies the 
representation of this information and the cognitive load 
of using the model significantly. The following triples 



represent statement si decomposed into statement ele- 
ments sel, se2, and se3, along with the provenance of 
sel, as rendered in Figure 3: 

ex: gal obo:has_part ex: si . 

ex : s 1 rdf : type kiao : Rdf Statement . 

ex : s 1 kiao : propertyElement ex : se3 . 

ex:se3 iao: denotes ex:hasSubjectDependent. 

ex : si kiao : sub jectElement ex : se2 . 

ex:se2 iao: denotes ex:t2. 

ex : si kiao : ob jectElement ex : sel . 

ex:sel iao:denotes ex:tl. 

ex: sel kiao:basedOn ex: ral . 



The first two triples above represent that graph an- 
notation gal has statement si as a part and that si is 
an RDF statement, and the next six triples decompose 
si into Rdf StatementElement instances and spec- 
ify their denotations. The relations sub jectElement, 
propertyElement, and ob jectElement have an 
rdfs: domain of kiao: Rdf Statement and an rdfs: 
range of kiao: Rdf StatementElement, and thus 
type information can be inferred using RDFS reason- 
ing, which we have omitted for conciseness. The sev- 
enth and eighth triples reify the object position of 
this statement, and the final triple documents that 
this statement element is based on resource annota- 
tion ral. In this way, the annotator constructing 
gal can explicitly document the origin of every com- 
ponent piece. 

Just as Rdf StatementElement instances can be 
based on instances of Rdf ResourceAnnotation, 
they can also be based on other instances of Rdf- 
StatementElement. As the composition of anno- 
tations becomes more complex and the layers of anno- 
tation get deeper, graph annotations will build on other 
graph annotations. This is especially true for annotations 
produced and used by computational language under- 
standing systems [24]. For example, in Figure 4, state- 
ment element se9 is based on statement element se6, 
which is in turn based on resource annotation ra7. 
Figure 4 only shows the provenance of statement elem- 
ent se6 of statement s2 along with the provenance of 
statement element se9 of statement s3. Although not 
depicted, statement-element-level provenance could 
analogously be recorded for all elements of these state- 
ments, as well as for all statements of graph annotations 
ga2 and ga3. The following are triples representing an- 
notation information for statements s2 and s3 and 
statement elements se6 and se9, including their proven- 
ance, rendered in Figure 4: 
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regulates 




Figure 4 Extended example of statement-element provenance. This figure depicts an extended example of statement-element-level 
provenance. One Rdf Statement from each graph and the six Rdf StatementElement instances have bold ovals with underlined labels. 
This figure is an extension of Figure 2, and some of the parts of that figure have been preserved here but grayed out. Dashed lines show 
assertions that can be inferred. (See the caption of Figure 1 for explanation of shapes and arrows). 



ex: ga2 obo:has_part ex: s2 . 

ex : s2 rdf : type kiao : Rdf Statement . 

ex : s2 kiao : sub jectElement ex : se4 . 

ex:se4 iao:denotes ex:Pl. 

ex:s2 kiao : propertyElement ex:se5. 

ex:se5 iao: denotes 

kro : results_in_regulation_of . 
ex : s2 kiao : ob jectElement ex : se6 . 
ex:se6 iao: denotes obo:PR_000001933 . 
ex : se6 kiao : basedOn ex : ra7 . 
ex : ga3 obo : has_part ex : s3 . 
ex:s3 rdf : type kiao: Rdf Statement. 
ex:s3 kiao : sub jectElement ex:se7 
ex:se7 iaordenotes exrPl. 
ex : s3 kiao : propertyElement ex : se8 . 
ex:se8 iao: denotes 

kro : results_in_regulation_of . 
ex:s3 kiao : ob jectElement ex:se9 
ex:se9 iao:denotes obo:PR_000001933 . 
ex : se9 kiao : basedOn ex : se6 . 



The first group of eight triples and the ninth triple 
above are exactly analogous to the triples used in the 
previous example. As in the previous example, here 
there is one reified statement (s2) that is part of a graph 
annotation (ga2). Analogously, the reified statement s3 
is a part of graph annotation ga3, as can be seen in the 
third group of (eight) triples. However, in this example, 
there is an extended statement-element-level assertion 
of provenance: In the last triple, a statement element 



(se9) is asserted to be based on another statement 
element (se6), which was already created to partly 
document the provenance of a graph annotation (ga2). 
In this way, the annotator creating graph annotation 
ga3 can unambiguously document the specific element 
of the specific statement of graph annotation ga2 from 
which its reference to the protein STAT6 derives. This 
low-level provenance is essential for understanding the 
dependencies between layers of complex composi- 
tional annotations and for being able to unwind these 
layers to perform tasks such as error analysis and blame 
attribution. 

Discussion 

Use cases 

In this section, we discuss types of tasks that our anno- 
tation model enables, along with specific examples of 
such tasks, including projects on which we are working 
as well as external efforts. 

Integrating different types of annotations 

A wide variety of annotation models and formats have 
been created for a wide range of tasks; however, the 
large majority of these are suited to one particular type 
of annotation or task and are not broadly applicable or 
interoperable. Our proposal is a generic metamodel of 
annotations and their linkage to their denoted know- 
ledge representations. As such, it is neutral with respect 
to annotation template {i.e., the choice of terminologies, 
ontologies, or schemas used for annotation), the nature 
of the denoted knowledge representations created, and 
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the methodology by which annotations and denoted 
knowledge representations are created. As a result, our 
model can be generically used to integrate different types 
of annotations in a common representation, in turn 
leading to enhanced interoperability and queryability 
among these different types of annotations. Such inte- 
gration is of substantial interest to us in the context of 
our efforts with the Colorado Richly Annotated Full- 
Text (CRAFT) Corpus, a collection of full-text biomed- 
ical journal articles that we have extensively marked up 
with a wide range of types of annotations, including 
those specifying mentions of biomedical concepts, 
coreference, discourse, as well as a variety of syntax, 
including sentence segmentation, tokenization, part-of- 
speech tagging, and Penn TreeBank tagging [16,25]. 
Relying on our generic metamodel, we are able to 
represent these disparate types of annotations and their 
denoted knowledge representations in a unified way. 
This, in turn, is required to enable matching of different 
types of annotations to various elements of formal 
natural-language patterns for automated understanding 
of biomedical text. This enables querying over multiple 
annotation types simultaneously, for example looking for 
all the noun-phrases that overlap with annotations to 
specific ontology terms, which might be a query for 
learning new vocabulary or patterns for identifying 
ontology terms in text. 

Text is not the only artifact for aggregating multiple 
types of annotation. Tools such as the UCSC Genome 
Browser which layer annotation tracks over a visualization 
of the genome [26] demonstrate a clear need for integrat- 
ing various types of genomic annotation, such as dbSNP 
[27], COSMIC [28], OMIM [29], and the GWAS catalog 
[30]. Use of our model would enable such annotations to 
be more easily integrated into such tool via a standard rep- 
resentation, and would further enable the integration of 
annotations to be usable beyond the scope of specific tools 
{e.g., visualization) for new queries yet to envisioned by 
researchers. 



Connecting annotations to tlie Semantic Web 

Many types of annotations are represented in ad hoc, idio- 
syncratic formats (e.g., Penn Treebank [15] for parts of 
speech, GAF 2.0 [31] for gene function, VCF [32] for 
SNPs) that, in addition to hindering interoperability, are 
obstacles to integration with the Semantic Web. We have 
implemented our annotation metamodel as a formal 
OWL ontology, and specific annotations are created as in- 
stances of relevant OWL classes. Consequently, linkage of 
these annotations and the data and knowledge they specify 
to existing ontologies [10], other RDF repositories [33], 
and the broader Semantic Web is considerably facilitated. 
Using our model, reasoning over annotation structure and 



their denoted semantics simultaneously is also enabled 
through the use of RDF- and OWL-based querying sys- 
tems. For example, it is possible to query for annotations 
that are based on an annotation from a specific source 
that mention a subclass of a specific ontology term. Such 
a query might be used if a specific annotator is known to 
be highly accurate or inaccurate at a certain task. Such 
queries cannot be easily performed, if they can be per- 
formed at all, using combinations of tools on more idio- 
syncratic formats. 

Linldng annotations to arbitrarily complex knowledge 
representations 

In most annotation efforts, each annotation consists of a 
basic association of one conceptual resource {e.g., an 
ontology class, schema element, database identifier) with 
one target {e.g., document, text span, database entry). 
However, as information needs increase and annotation 
efforts expand to capture more complex information, 
more complex knowledge structures will need to be repre- 
sented. For example, although most GO annotations of 
genes/gene products straightforwardly link GO terms to 
database entries representing these genes/gene products, 
there has been a call to associate the biological-functionality 
annotations of the genes/gene products with more specific 
contexts, such as the types of cellular locations or cells 
in which these biological functionalities were observed 
[31]. Similar calls for representing complex structures 
in linguistic annotation are also being made, for ex- 
ample representing semantic frames composed from 
other annotations [34] . In our own work, the next layer 
of annotation planned for the CRAFT corpus will also 
require the ability to dynamically construct concepts 
for use in assertional annotation [8]. To capture such 
information, annotations must point to knowledge struc- 
tures more complex than singular concepts. We have rep- 
resented two fundamental types of annotations: resource 
annotations, each of which points to a single RDF re- 
source, and graph annotations, each of which points to an 
RDF graph encapsulating one or more RDF statements. 
Using our model, a user can create any combination of re- 
source annotations and/or graph annotations, as moti- 
vated by the complexity of information that is sought to 
be captured in a given annotation effort. 

Documenting the composition of annotations 

As the annotations become more complex annotators 
(both human and computational) need the ability to 
refer to existing annotations as the basis of more com- 
plex annotations. While ontologies and other vocabular- 
ies used for annotation strive to be complete, it is likely 
that specific applications will require dynamic construc- 
tion of concepts, either through data-driven methods 
[7,14]. or compositional concept formation [8]. 
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Natural language processing (NLP) pipelines are fre- 
quently composed of sequences of components that each 
produce output based on the output of earlier annotation 
layers. For example, the output of a tokenizer might be fed 
into a part-of-speech detector, and both of which are fed 
into a named-entity recognizer, each component of which 
is dependent on the output of some or all of the previous 
components. There are generalized frameworks for building 
annotation pipelines, such as UIMA [35,36]; however these 
pipelines do not provide standardized models for docu- 
menting the compositional provenance of annotations. 
Other systems for language understanding, such as Direct 
Memory Access Parsing systems [37,38] including Open- 
DMAP [39] and REDMAP [24], use hierarchical patterns 
to compose semantic annotations. These systems produce 
knowledge structures that are analogous to those presented 
in Figure 2; however, they have no standard methods for 
documenting provenance. All of these use cases are covered 
in a generalized way by our model, which enables the track- 
ing of both coarse-grained, annotation-level provenance 
and fine-grained, statement-element-level provenance. 

Understanding the genetic basis of disease is a major 
focus of current biological and bioinformatics research 
and requires the integration of numerous types of anno- 
tation. For example, epistasis captures interactions be- 
tween genes that affect function and phenotype, and 
compositional epistasis has been introduced [40] as a 
way to model multiple genes affecting a phenotype. A 
typical method for identifying epistasis starts with anno- 
tations of SNPs (single-nucleotide polymorphisms) and 
then applies a procedure for determining interactions 
among them [41,42]; the inputs to such procedures are 
SNP annotations and the outputs can be modeled as a 
higher-order annotation over the genome sequence that 
connects two or more of the SNP annotations. Captur- 
ing the dependency of the epistasis relation on the prior 
SNP annotation is important; SNP identification (variant 
calling) is a process that is dependent on the initial se- 
quencing and assembly, the reference genome, and other 
factors and as such the SNP annotations of a genome 
may vary with different analyses. Our model would en- 
able identifying and distinguishing epistasis relationships 
determined on the basis of one variant analysis from 
those based on another analysis performed under differ- 
ing conditions. 

Analyzing annotation errors and attributing blame 

A common use of provenance information is for error 
analysis and blame attribution tasks. For example, if an 
annotation is deemed incorrect, the method by which 
that annotation was constructed needs to be investi- 
gated. This often starts with identifying all the annota- 
tions that contributed to its generation. It is possible 
that a lower-level annotation is incorrect and its use 



alone led to the larger annotation being incorrect. In the 
case of DMAP-style pattern recognition, this type of 
analysis is critical both for debugging during develop- 
ment as well as for analysis of results such as that done 
in the evaluation of REDMAP [24] . For example, in that 
evaluation it was important to identify whether errors 
were due to named-entity recognizers improperly identi- 
fying entities in the text or to larger patterns being im- 
properly applied. Working in the opposite direction, if a 
lower-level annotation is deemed incorrect, it is impor- 
tant to identify all downstream annotations that are 
based on that annotation so that they too can be identi- 
fied as incorrect and retracted. (Please see later section 
titled "Querying Using the Model" for examples of 
SPARQL queries that extract this type of provenance in- 
formation using our model.) 

Efficiency of the model 

Modeling statement-element-level provenance comes 
with the cost of reifying the statements and statement 
elements. In the worst case, this cost is 10 triples per 
statement in the graph annotation, or 14 triples if infer- 
able type triples are explicitly represented as well: 7 tri- 
ples are required to reify the statement, and 1 triple is 
needed each for the subject, property, and object of the 
statement to record its provenance. (However, in our ex- 
perience, it is rare to record provenance for the prop- 
erty.) For example, as rendered in Figure 4, gal requires 
2 triples for the annotation (3 with type information), 4 
triples to record annotation-level provenance, and 9 tri- 
ples (13 with types) to record the statement-element- 
level provenance. In contrast in the OA model [3] it 
takes 7 triples per text span to anchor an annotation to 
a piece of text. If the text has multiple spans there is an 
additional 2-triple overhead. To connect gal to text in 
the OA model requires 16 triples, 7 triples for each of 
the two text spans plus 2 triples of overhead for having 
multiple spans. Table 1 shows the number of triples re- 
quired to model the example graph annotations used in 
this document and their provenance. It also shows the 
number of triples required to anchor these annotations 



Table 1 Counts of triples required to represent annotations 



Annotation 


Annotation 
triples 


Annotation 
provenance 
triples 


Statement 
element 
provenance 
triples 


Text 
span 
triples 


gal 


2(3) 


4 


9(13) 


16 


ga2 


3 (4) 


2 


1 6 (24) 


16 


ga3 


6(7) 


3 


32 (48) 


30 



Counts of triples required to represent annotations (and total counts including 
Inferable type triples), their statement-element-level provenance, and their 
associated text spans. 
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to their corresponding spans of text using the OA 
model. In terms of counts of RDF triples required, it can 
be seen that recording statement-element-level prove- 
nance is comparable to associating annotations with text. 

We aimed in this work to lay down a low-level foun- 
dation for annotation provenance, which can then serve 
as the building blocks for higher-level models or 
axiomatization. We acknowledge that the use of reifica- 
tion to explicitly identify all of the low-level parts in 
our model leads to the production of additional 
triples. However, as more support for reasoning with ax- 
iomatizations in triple stores becomes available, exten- 
sions and abstractions of our model that reduce the 
counts of triples required could be defined. For example, 
if triples are reused from one graph to another, as is the 
case for statements s2 and s3 in Figure 4, a relation 
such as copyOf could be defined and used to directly 
connect these statements so as to obviate the need to 
reify and relate all of the corresponding RdfState- 
mentElement instances from both statement triples. 
As the model is applied in practice, other patterns may 
emerge that point to additional optimizations or refine- 
ments. Mappings could also be constructed from our 
model to nanopublications [43,44] or RDF Molecules [45] 
to potentially reduce the number of redundant triples. 

Querying using the model 

Due to the manner in which the model was integrated with 
the Relation Ontology (RO) and the Information Artifact 
Ontology (lAO), querying of the model can be quite 
straightforward in SPARQL. A common provenance- 
tracking task might be to identify all the annotations that 
are based on a given annotation that is suspected of being 
incorrect. For example, the following SPARQL 1.1 query 
returns all of the annotations that are based on ra7: 

prefix rdf : <http://www.w3.org/1999/02/22- 
rdf - syntax-ns#> 

prefix kiao : <http : / /kabob . ucdenver . edu/ iao/> 
prefix obo: <http: //purl. obolibrary.org/obo/> 
prefix ex: <http: //www. example.org/> 
select distinct ?a where { 
?a obo:has_part/kiao:basedOn ex:ra7 . 
?a rdf: type kiao : Annotation .} 
results : 

I ex:ga2 | 
I ex:ga3 | 

Similarly, a researcher may be interested in all of the 
statements that are based on a given annotation. If these 
Rdf GraphAnnotation instances have had their 
statement-element-level provenance recorded using our 
model, such statements could be queried for directly. For 



example, the following SPARQL 1.1 query would retrieve 
all statements each of which has at least one statement 
element that is based on resource annotation ra7: 

prefix rdf: <http://www.w3.org/1999/02/22- 
rdf-syntax-ns#> 

prefix iao : <http : / /kabob . ucdenver . edu/ iao/> 
prefix kiao : <http : / /kabob . ucdenver . edu/iao/> 
prefix obo: <http://purl.obolibrary.org/obo/> 
prefix kro: <http: //kabob. ucdenver.edu/ro/> 
prefix ex: <http: //www. example.org/> 
select distinct ?statement ?subject ? 
property ?object where { 

?statement obo:has_part/kiao:basedOn ex:ra7 . 
? statement rdf : type kiao : Rdf Statement ; 
kiao: subjectElement/iao:denotes ?subject; 
kiao :propertyElement/ iao : denotes ?property ; 
kiao :objectElement/iao: denotes ?object. } 
results : 

I ?statement j ?subject j ?property | ?object | 
|ex:s2 |ex:Pl | kro : results_in_regulation_of | 
obo:PR_000001933| 

|ex:s3 |ex:Pl | kro : results_in_regulation_of | 
obo:PR_000001933| 

The first three namespaces are needed as part of the 
annotation model, and the last three are specific to the 
example domain and only used for more conveniently 
rendering the results. While this query and the one be- 
fore are modeled using SPARQL 1.1 property paths, 
there is nothing in our model that requires them for 
querying. For example, using SPARQL 1.0, the property 
paths could be expanded using blank nodes or variables 
that are not captured in the results. 

Related work 

Efforts in the representation of more structured anno- 
tations have tended to be idiosyncratic, specific to a 
particular type of annotation or task, and not broadly 
interoperable. For example, for the task of Gene Ontology 
(GO) annotation, in which the functionalities of genes and 
gene products represented in biomedical databases are as- 
sociated to GO terms [46], the Gene Association File 
format (GAF 2.0) [31] enables the representation of con- 
straints on the context in which a given annotation might 
be valid {e.g., the type of cell in which the functionality is 
asserted to be present); however, this format is specific to 
this narrow task. Analogously, the corpus and computa- 
tional linguistics communities have developed solutions 
for representing complex syntax and semantics for docu- 
ments, e.g., the Penn Treebank format [15], but these 
representations are mostly idiosyncratic and not interoper- 
able. The Linguistic Annotation Framework (LAF) [34], 
along with its XML-based serialization GrAF [47] and 
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the RDF-based representation DADA [48], allow for the 
markup of a wider range of linguistic phenomena, but they 
only permit the specification of functional (single-valued) 
properties. POWLA [49] is another linguistic corpora an- 
notation formalism based in RDF and OWL, but, like 
LAF/GrAF, it introduces an ambiguity by using the same 
identifier to anchor information about both the annotation 
and what it denotes; hence, the formalism cannot clearly 
capture which information applies to the annotation and 
which applies to the denoted knowledge [50]. 

Several prominent efforts have focused on general rep- 
resentations of annotation, notably the Open Annotation 
(OA) model [3] and the Annotation Ontology (AO) [4]. 
These models represent three basic pieces of informa- 
tion for a given annotation: the thing being annotated 
{e.g., a span of text), the denoted knowledge representa- 
tion (i.e., the concept or set of assertions denoted by the 
annotation), and the annotation itself, which connects 
the other two. These three things can be broadly aligned 
across the two models as well as with our model for 
the linkage of annotations to their denoted knowledge 
representations. A proposed integration of our model with 
the Open Annotation model can be found in Additional 
file 1 and an integration with the Annotation Ontology 
model in Additional file 2. In this paper, we have elided 
discussion of metadata such as author and creation date 
as well the connection of annotations to their respective 
targets, and our annotation model makes no constraints 
or requirements as to how these pieces of information are 
represented. The relations used by the Open Annotation 
model and the Annotation Ontology would both work 
well, and for most of such metadata the two models 
largely capture the same details. Though constructs that 
can denote complex knowledge structures have very re- 
cently been added to these models, there have been no 
mechanisms put forth by which complex annotations can 
be composed of more atomic annotations with their prov- 
enance unambiguously recorded. 

Both the OA and AO annotation models seem to sup- 
port one annotation pointing to multiple targets. However, 
it is ambiguous as to whether the annotation applies 
equally and independently to each target (e.g., as for an an- 
notation targeting the text spans of multiple mentions of 
"STAT6" in a piece of text with the corresponding Protein 
Ontology class (PR: 000001933)) or if it is the union of 
the targets that is being annotated {e.g., as for an annota- 
tion targeting the two discontinuous spans of text "c-ter- 
minal" and "tails" from the phrase "c-terminal cytoplasmic 
tails" with the Sequence Ontology class c_terminal_ 
region (SO: 0100015) [51]). We strongly assert that an 
annotation with multiple targets should be interpreted as 
a single discontinuous annotation and that the alternate 
shared-annotation interpretation should be disallowed by 
all models. On its surface, the shared-annotation 



interpretation seems to be beneficial in that it saves triples 
and seems easier to create. However, it muddles informa- 
tion represented for the purposes of provenance tracking 
or error analysis; for example, if three out of four of the 
shared targets for an annotation are correct, but the fourth 
target is incorrect, this information could not be accu- 
rately captured. Furthermore, in the case of compositional 
annotations, it could not be clearly represented which of 
the shared annotations and targets are connected via 
basedOn links and which are not. When considering in- 
creasingly complex annotations and how annotations will 
be used by downstream applications and models, it is clear 
that one annotation to one Annotation instance is the 
only lossless approach. 

Numerous other models of triple-level provenance also 
exist, for example, PaCE [52] and RDF coloring [53], but 
these models require more complicated URI-minting pro- 
cedures and systems that can understand the composi- 
tional URIs they produce. The most related triple- 
provenance model is that for nanopublications [43,44], 
which is compatible with our GraphAnnotation model 
in that it provides a method for recording triple-level 
provenance and annotating sets of triples with metadata. 
However, the primary purpose of nanopublications is to 
enable attribution and validation of scientific statements, 
and as such it does not model resource annotations, tar- 
geting annotations to other content such as text, or fine- 
grained statement-element-level provenance. Our approach 
is complementary to microattribution proposals to attribute 
data such as disease-implicated genetic variants to the sci- 
entists who determine them [54] . 

Also related to our research is work being done by the 
scientific workflow provenance community. Proof Markup 
Language [55] models the justifications of reasoning results 
from Semantic Web services, while work such as Provair 
[5] aims to document work-flow provenance. Trust and 
authenticity are also active areas of research [56]. Through 
provenance workshops [57] and challenge meetings [58], 
the Open Provenance Model (OPM) [59] has been devel- 
oped. Other community efforts have led to the creation of 
the PROV [60] model, which provides a data model for 
building representations of the entities, people and pro- 
cesses involved in producing a piece of data or thing. A 
proposed integration of our provenance relations with the 
object-centric portion of PROV-O (the OWL-ontology 
version of PROV) [61] is provided in Additional file 4. 

Conclusions 

We have presented a model for representing compositional 
annotations and annotation provenance, and provided ex- 
amples of application areas for the model. The model can 
be used to link annotations to their denoted knowledge 
representations, and we have divided the annotation space 
into resource annotations, in which RDF resources are used 
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to annotate targets, and graph annotations, in which graphs 
composed of one or more RDF triples are used. With this 
model, progressively more complex annotations can be 
composed from other annotations, and this provenance can 
be unambiguously represented at either a coarse- or fine- 
grained level. We have designed our annotation model to 
be generic so as to facilitate the concurrent use of multiple 
types of annotations. Additionally, it allows for the creation 
of arbitrarily complex annotations, both in terms of their 
denoted knowledge and of any other annotations upon 
which they rely. All of this information can be losslessly re- 
corded, thus facilitating inference and error tracking in 
large computational annotation efforts. We have provided 
an OWL representation of our model integrated with the 
Information Artifact Ontology, as well as proposed integra- 
tions with the Open Annotation model, the Annotation 
Ontology, and the PROV Ontology. 

Endnotes 

"Although activation (mentioned in the example sen- 
tence fragment in Figure 2) is semantically narrower 
than positive regulation, we use the GO class posi- 
tive regulation of biological process here 
for simplicity, as there is no more specific subclass in 
the GO that generically represents the activation of a 
biological process. Similarly, inhibition is semantically 
narrower than the GO class negative regulation 
of biological process. 

The OBO Relation Ontology, upon which the ontol- 
ogies of the OBO library rely, uses the obo: namespace. 
We extend the Relation Ontology using the namespace 
kro:. 

While relations are typically named as verbs or verb 
phrases, we modeled these relation names to be analo- 
gous to the core RDF statement model. 

Additional files 



Additional file 1: Appendix A. Alignment with the Open Annotation 
Model. Appendix describing an alignment from the proposed model to 
the Open Annotation model. 

Additional file 2: Appendix B. Alignment with the Annotation 
Ontology. Appendix describing an alignment from the proposed model 
to the Annotation Ontology model. 
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extension of the lAO. 

Additional file 4: Appendix C. Alignment with the PROV Ontology. 
Appendix describing an alignment from the proposed model to the 
PROV OWL model. 
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